{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "punL79CN7Ox6"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "_ckMIh7O7s6D"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9uBA1i1BbiJn"
      },
      "source": [
        "# Word Embeddings and Sentiment"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "S5Uhzt6vVIB2"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l09c04_nlp_embeddings_and_sentiment.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l09c04_nlp_embeddings_and_sentiment.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "1iLe4E0dB7tj"
      },
      "source": [
        "In this colab, you'll work with word embeddings and train a basic neural network to predict text sentiment. At the end, you'll be able to visualize how the network sees the related sentiment of each word in the dataset."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "wqvz1jVgbwIN"
      },
      "source": [
        "## Import TensorFlow and related functions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "XIG52aKPdpux"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "\n",
        "from tensorflow.keras.preprocessing.text import Tokenizer\n",
        "from tensorflow.keras.preprocessing.sequence import pad_sequences"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "pazU5OmxehIA"
      },
      "source": [
        "## Get the dataset\n",
        "\n",
        "We're going to use a dataset containing Amazon and Yelp reviews, with their related sentiment (1 for positive, 0 for negative). This dataset was originally extracted from [here](https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "qpwQT2E9ez5B"
      },
      "outputs": [],
      "source": [
        "!wget --no-check-certificate \\\n",
        "    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "6Zvp9NScfMnw"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "dataset = pd.read_csv('/tmp/sentiment.csv')\n",
        "\n",
        "sentences = dataset['text'].tolist()\n",
        "labels = dataset['sentiment'].tolist()\n",
        "\n",
        "# Separate out the sentences and labels into training and test sets\n",
        "training_size = int(len(sentences) * 0.8)\n",
        "\n",
        "training_sentences = sentences[0:training_size]\n",
        "testing_sentences = sentences[training_size:]\n",
        "training_labels = labels[0:training_size]\n",
        "testing_labels = labels[training_size:]\n",
        "\n",
        "# Make labels into numpy arrays for use with the network later\n",
        "training_labels_final = np.array(training_labels)\n",
        "testing_labels_final = np.array(testing_labels)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "NHpvqaSigcST"
      },
      "source": [
        "## Tokenize the dataset\n",
        "\n",
        "Tokenize the dataset, including padding and OOV"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "78icewYRgfxh"
      },
      "outputs": [],
      "source": [
        "vocab_size = 1000\n",
        "embedding_dim = 16\n",
        "max_length = 100\n",
        "trunc_type='post'\n",
        "padding_type='post'\n",
        "oov_tok = \"<OOV>\"\n",
        "\n",
        "\n",
        "from tensorflow.keras.preprocessing.text import Tokenizer\n",
        "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
        "\n",
        "tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)\n",
        "tokenizer.fit_on_texts(training_sentences)\n",
        "word_index = tokenizer.word_index\n",
        "sequences = tokenizer.texts_to_sequences(training_sentences)\n",
        "padded = pad_sequences(sequences,maxlen=max_length, padding=padding_type, \n",
        "                       truncating=trunc_type)\n",
        "\n",
        "testing_sequences = tokenizer.texts_to_sequences(testing_sentences)\n",
        "testing_padded = pad_sequences(testing_sequences,maxlen=max_length, \n",
        "                               padding=padding_type, truncating=trunc_type)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "q4yIEk_8kszh"
      },
      "source": [
        "## Review a Sequence\n",
        "\n",
        "Let's quickly take a look at one of the padded sequences to ensure everything above worked appropriately."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "JTU3FmVGk100"
      },
      "outputs": [],
      "source": [
        "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n",
        "\n",
        "def decode_review(text):\n",
        "    return ' '.join([reverse_word_index.get(i, '?') for i in text])\n",
        "\n",
        "print(decode_review(padded[1]))\n",
        "print(training_sentences[1])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "RI91liJnlA92"
      },
      "source": [
        "## Train a Basic Sentiment Model with Embeddings"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "bBMgzp-_lMTp"
      },
      "outputs": [],
      "source": [
        "# Build a basic sentiment network\n",
        "# Note the embedding layer is first, \n",
        "# and the output is only 1 node as it is either 0 or 1 (negative or positive)\n",
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.Flatten(),\n",
        "    tf.keras.layers.Dense(6, activation='relu'),\n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
        "model.summary()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Pfl1W-zVldpn"
      },
      "outputs": [],
      "source": [
        "num_epochs = 10\n",
        "model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GjMZ4ZFQl_48"
      },
      "source": [
        "## Get files for visualizing the network\n",
        "\n",
        "The code below will download two files for visualizing how your network \"sees\" the sentiment related to each word. Head to http://projector.tensorflow.org/ and load these files, then click the \"Sphereize\" checkbox."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "S2lB46FirAVx"
      },
      "outputs": [],
      "source": [
        "# First get the weights of the embedding layer\n",
        "e = model.layers[0]\n",
        "weights = e.get_weights()[0]\n",
        "print(weights.shape) # shape: (vocab_size, embedding_dim)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Xcha0oGemHX2"
      },
      "outputs": [],
      "source": [
        "import io\n",
        "\n",
        "# Write out the embedding vectors and metadata\n",
        "out_v = io.open('vecs.tsv', 'w', encoding='utf-8')\n",
        "out_m = io.open('meta.tsv', 'w', encoding='utf-8')\n",
        "for word_num in range(1, vocab_size):\n",
        "  word = reverse_word_index[word_num]\n",
        "  embeddings = weights[word_num]\n",
        "  out_m.write(word + \"\\n\")\n",
        "  out_v.write('\\t'.join([str(x) for x in embeddings]) + \"\\n\")\n",
        "out_v.close()\n",
        "out_m.close()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "g-Q6ALywmWVz"
      },
      "outputs": [],
      "source": [
        "# Download the files\n",
        "try:\n",
        "  from google.colab import files\n",
        "except ImportError:\n",
        "  pass\n",
        "else:\n",
        "  files.download('vecs.tsv')\n",
        "  files.download('meta.tsv')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GNoxfY-i3Ir1"
      },
      "source": [
        "## Predicting Sentiment in New Reviews\n",
        "\n",
        "Now that you've trained and visualized your network, take a look below at how we can predict sentiment in new reviews the network has never seen before."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "QXtfw-OY3WoZ"
      },
      "outputs": [],
      "source": [
        "# Use the model to predict a review   \n",
        "fake_reviews = ['I love this phone', 'I hate spaghetti', \n",
        "                'Everything was cold',\n",
        "                'Everything was hot exactly as I wanted', \n",
        "                'Everything was green', \n",
        "                'the host seated us immediately',\n",
        "                'they gave us free chocolate cake', \n",
        "                'not sure about the wilted flowers on the table',\n",
        "                'only works when I stand on tippy toes', \n",
        "                'does not work when I stand on my head']\n",
        "\n",
        "print(fake_reviews) \n",
        "\n",
        "# Create the sequences\n",
        "padding_type='post'\n",
        "sample_sequences = tokenizer.texts_to_sequences(fake_reviews)\n",
        "fakes_padded = pad_sequences(sample_sequences, padding=padding_type, maxlen=max_length)           \n",
        "\n",
        "print('\\nHOT OFF THE PRESS! HERE ARE SOME NEWLY MINTED, ABSOLUTELY GENUINE REVIEWS!\\n')              \n",
        "\n",
        "classes = model.predict(fakes_padded)\n",
        "\n",
        "# The closer the class is to 1, the more positive the review is deemed to be\n",
        "for x in range(len(fake_reviews)):\n",
        "  print(fake_reviews[x])\n",
        "  print(classes[x])\n",
        "  print('\\n')\n",
        "\n",
        "# Try adding reviews of your own\n",
        "# Add some negative words (such as \"not\") to the good reviews and see what happens\n",
        "# For example:\n",
        "# they gave us free chocolate cake and did not charge us"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "l09c04_nlp_embeddings_and_sentiment.ipynb",
      "private_outputs": true,
      "provenance": [
        {
          "file_id": "1iq_tSJPWWfj2tCp3Z3ovpcFT9QuUlwgr",
          "timestamp": 1585785880918
        }
      ],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
